Linear Regression

Core Concept

Linear regression models the target variable as a linear combination of the features plus an intercept (bias). For a single target (y) and feature vector (\mathbf{x}), the model is (y = \mathbf{w}^\top \mathbf{x} + b) (or (y = \beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p)). The learned hyperplane in feature space minimizes a loss over the training set—typically sum of squared errors (SSE) or mean squared error (MSE)—yielding a unique closed-form solution (normal equation) when the design matrix is full rank. This represents the foundational regression approach: interpretable coefficients, fast training, and a single global fit that is well-understood statistically and serves as a baseline for more flexible methods.

Key Characteristics

  • Closed-form solutionUnder squared-error loss, the optimal weights are given by the normal equation (\mathbf{w} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}), assuming (\mathbf{X}^\top \mathbf{X}) is invertible. No iterative optimization is required for the basic formulation, though gradient descent is used for large-scale or regularized variants.
  • InterpretabilityEach coefficient (\beta_j) can be read as the expected change in the target per unit change in (x_j), holding other features constant. Sign and magnitude of coefficients support feature importance and causal-style reasoning, subject to correlation and confounding.
  • Single global fitOne set of weights applies everywhere in the feature space; the model cannot capture different slopes or curvature in different regions unless features are engineered (e.g. interactions, polynomial terms) or the model is extended (e.g. piecewise linear).
  • AssumptionsClassical inference (standard errors, confidence intervals) assumes linearity, independence of errors, homoscedasticity (constant error variance), and often normality of errors. Violations affect inference more than the fitted values; robust or heteroscedastic-consistent standard errors can relax some assumptions.
  • RegularizationRidge (L2) and Lasso (L1) add penalties on (\mathbf{w}), shrinking coefficients or performing feature selection; they improve generalization when (p) is large or features are correlated and are solved by iterative optimization (e.g. gradient descent) rather than the standard normal equation.

Common Applications

  • Demand and sales forecastingPredicting quantity sold or revenue from price, promotion, seasonality, and other covariates
  • Housing and asset valuationEstimating price from size, location, number of rooms, and similar attributes
  • Risk and exposure modelingPredicting continuous risk scores or exposure levels from demographic and behavioral features
  • Trend and time-index regressionModeling a quantity as a linear function of time or an index when the relationship is approximately linear
  • Causal and policy analysisEstimating treatment effects or policy impacts when linearity and identification assumptions hold; coefficients support interpretable comparison across groups or conditions
  • Baseline and residual analysisUsing linear regression as a simple baseline; examining residuals to guide feature engineering or choice of more flexible models

Linear Regression Algorithms

Linear regression algorithms differ in how they minimize the loss (closed-form vs iterative), whether they apply regularization (L2, L1, or both), and in their robustness to outliers, multicollinearity, and scale. Choice depends on sample size, number of features, need for interpretability or sparsity, and whether uncertainty estimates are required.

  • Ordinary Least Squares (OLS) – Minimizes sum of squared errors via the normal equation (\mathbf{w} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}); closed-form, unique solution when (\mathbf{X}^\top \mathbf{X}) is invertible; interpretable coefficients and standard errors under classical assumptions.

  • Ridge Regression – OLS with L2 penalty (\lambda |\mathbf{w}|_2^2); shrinks coefficients toward zero, improving conditioning when features are correlated or (p) is large; solution remains closed-form with ((\mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I})^{-1} \mathbf{X}^\top \mathbf{y}).

  • Lasso Regression – OLS with L1 penalty (\lambda |\mathbf{w}|_1); promotes sparse solutions (some coefficients exactly zero), performing feature selection; solved iteratively (e.g. coordinate descent) since no closed-form under L1.

  • Elastic Net – Combines L1 and L2 penalties; balances Ridge’s stability under correlation with Lasso’s sparsity; useful when many correlated features and group or sparse selection is desired.

  • Gradient Descent – Iteratively updates weights by moving in the direction that reduces MSE (or other loss); used when the design matrix is too large for the normal equation or when combined with regularization that has no closed-form.

  • Stochastic Gradient Descent (SGD) – Gradient descent using a single example or mini-batch per update; scales to very large (n); requires tuning learning rate and schedule; sklearn’s SGDRegressor supports MSE, Huber, and epsilon-insensitive losses with optional L2/L1 penalty.

  • Huber Regression – Minimizes a loss that is quadratic for small errors and linear for large ones; more robust to outliers than MSE while remaining differentiable; solved iteratively (e.g. iteratively reweighted least squares or SGD).

  • RANSAC (RANdom SAmple Consensus) – Fits a linear model to random subsets of data and keeps the model with most inliers; robust to a large fraction of outliers; useful when the data may contain many anomalous points.

  • Bayesian Linear Regression – Places a prior on the weights and updates to a posterior given the data; provides predictive uncertainty (e.g. posterior predictive distribution) and principled handling of regularization via the prior; closed-form under Gaussian prior and likelihood.